The dataset of choice for Project One was originally constructed by the World Health Organization (WHO) to track the health status of 193 countries across the world. The question we sought to answer is as follows:
What factors affect life expectancy in individuals across the world?
The dataset includes a variety of factors that contribute to the overall health status of a country. We wanted to know what effect these factors had specifically on life expectancy. An explanation of the different factors in this dataset are as follows:
• Country: Country
• Year: Year
• Status: Developed or Developing status
• Life expectancy: Life Expectancy in age
• Adult Mortality: Adult Mortality Rates of both sexes (probability of dying between 15 and 60 years per 1000 population)
• Infant deaths: Number of Infant Deaths per 1000 population
• Alcohol: Alcohol, recorded per capita (15+) consumption (in litres of pure alcohol)
• Percentage expenditure: Expenditure on health as a percentage of GDP per capita(%)
• Hepatitis B: HepB immunization coverage among 1-year-olds (%)
• Measles: Number of reported measles cases per 1000 population
• BMI: Average Body Mass Index of entire population
• Under-five deaths: Number of under-five deaths per 1000 population
• Polio: Pol3 immunization coverage among 1-year-olds (%)
• Total expenditure: General government expenditure on health as a percentage of total government expenditure (%)
• Diphtheria: DTP3 immunization coverage among 1-year-olds (%)
• HIV/AIDS: Deaths per 1 000 live births HIV/AIDS (0-4 years)
• GDP: Gross Domestic Product per capita (in USD)
• Population: Population of the country
• Thinness 10-19 years: Prevalence of thinness among children/adolescents, age 10 - 19(%)
• Thinness 5-9 years: Prevalence of thinness among children, age 5 to 9 (%)
• Income comp of resources: HDI in terms of income composition of resources (index ranging from 0 to 1)
• Schooling: Number of years of Schooling (years)
Data Source: Kaggle
life2 <- na.omit(life) #omit missing values
loadPkg("ggplot2")
ggplot(life, aes(x=Life.expectancy) )+
geom_histogram(color="darkblue",fill="lightblue")+
ggtitle("Life Expectancy Histogram")+
xlab("Life Expectancy (Age)")
qqnorm(life$Life.expectancy, main="Life Expectancy Q-Q Plot", ylab="Life Expectancy (Age)")
qqline(life$Life.expectancy)
#ggplot(life2, aes(x=Total.expenditure, y=Life.expectancy, fill=Total.expenditure, group = 1)) + geom_boxplot() + scale_fill_brewer(palette="Spectral") + ggtitle("Life Expectancy vs. Total Expenditure") + ylab("Life Expectancy") + xlab("Total Expenditure ($)")
plot(life$Total.expenditure, life$Life.expectancy, main=" Life Expectancy (y) vs. Govt Expenditure on Healthcare(x) ",
xlab="% Healthcare Expenditure (out of total Govt spending)", ylab="Life Expectancy (Age)", pch=19) +
abline(lm(life$Life.expectancy~life$Total.expenditure), col="red") # regression line (y~x)
ggplot(life, aes(x=factor(Year), y=Life.expectancy))+
geom_boxplot() +
facet_wrap(~Status) +
theme(axis.text.x = element_text(angle = 90)) +
ylab("Life Expectancy (Age)")
#sapply(life, mean, na.rm=TRUE) # excluding missing values
#sapply(life, sd)
summary(life)
To determine the relationship between numerical variables and ‘Life Expectancy’, a correlation was performed. The variables “Country” and “Status” were removed prior to perforing the corration, as they are factor variables and will be addressed later. Additionally, rows that were blank or had “NA” in them were ignored, as array sized must be equivalent to perform a correlation between variables.
The resulting correlation matrix can be viewed below. It was found that the top 5 variables correlated with ‘Life Expectancy’ were ‘Adult Mortality’, ‘Income Composition of Resources’, ‘Schooling’, ‘HIV/AIDS’, and ‘BMI’. The correlation coefficients for these variables were -0.70, 0.73, 0.75, -0.56, and 0.57, respectively. The negative value in front of the correlations for ‘Adult Mortality’ and ‘HIV/AIDS’ indicates that as the life expectancy increases, the adult mortality and incidence of AIDS decreases. It is important to note that the variable ‘Population’ had virtually no correlation with ‘Life Expectancy’, as it had a correlation coefficient of -0.02.
library(dplyr)
sapply(life, class) #look at the class of each variables
life_nofactor = select(life, -c(Country, Status)) #remove the factor variables
cor_life=cor(life_nofactor,use = "complete.obs") #create correlation matrix
library(corrplot)
corrplot(cor_life, type="lower") #plot correlation matrix
ggplot(life, aes(x=GDP, y=percentage.expenditure)) +
geom_point() +
geom_smooth() +
ylab("Expenditure on health as a % of GDP per capita(%)") +
xlab("Gross Domestic Product (GDP per capita in USD)") +
ggtitle("Health Expediture (y) vs GDP (x)")
ggplot(life, aes(x=HIV.AIDS, y=Life.expectancy)) +
geom_point() +
geom_smooth() +
ylab("Life Expectancy (Age)") +
xlab("Deaths due to HIV/AIDS per 1000 live births (for 0-4 years)")+
ggtitle("Life Expectancy (y) vs Deaths due to HIV/AIDS (x)")
ggplot(life, aes(x=Schooling, y=Life.expectancy)) +
geom_point() +
geom_smooth() +
ylab("Life Expectancy (Age)") +
xlab("Years of Schooling")+
ggtitle("Life Expectancy (y) vs Years of Schooling (x)")
The relationship between ‘Adult Mortality’ and ‘Life Expectancy’ was ignored for the duration of this EDA due to the obviousness lack of information that would be gathered from further exloring this. It is redundent information that as the adult mortality rate increased, the life expectancy would decrease. The other variables, however, are explored in further detail.
The factor variables that were removed earlier were ‘Country’ and ‘Status’. The ‘Status’ was that of coutries that were ‘Developed’ and ‘Developing’. To determine whether the a country’s life expectancy was different for ‘Developed’ versus ‘Developing’ countries, a t-test was performed. It was found that the average life expectancy in the developed countries was 78.7 years of age and that in the developing countries was 67.7 years. The t-test results found the mean life expectancy to be significantly different, with a p-value of less than 2e-16. To visually discern the difference in life expectancies, histograms (one showing actual counts and the other showing proportions) were generated. As can be seen, the life expectancy of those in developed countries are all grouped toward the higher (right-hand) side of the histogram, while those in developong countries are, on average, much lower.
#summary(life)
life_developed=na.omit(subset(life, Status=='Developed')) #subset of countries that are Developed
life_developing=na.omit(subset(life, Status=='Developing')) #subset of countries that are Developing
#Life expectancy of Developed/Developing are different as p<<0.05
mean_life_developed = mean(life_developed$Life.expectancy)
mean_life_developing = mean(life_developing$Life.expectancy)
t_life_developed = t.test(x=life_developed$Life.expectancy, conf.level=0.95 )
t_life_developing= t.test(x=life_developing$Life.expectancy, conf.level=0.95 )
t_life_developed$conf.int
t_life_developing$conf.int
mean_life_developed
mean_life_developing
ttest_status = t.test(life_developed$Life.expectancy, life_developing$Life.expectancy)
#make overlaying histograms of the two subgroups aes(x = rank, y = gpa, fill = admit)
# First distribution
hist(life_developing$Life.expectancy, col=rgb(1,0,0,0.5), xlab="Life Expectancy",
ylab="Count", main="Life Expectancy of Developed vs Developing Countries-Actual Count" )
# Second with add=T to plot on top
hist(life_developed$Life.expectancy, col=rgb(0,0,1,0.5), add=T)
# Add legend
legend("topright", legend=c("Developing","Developed"), col=c(rgb(1,0,0,0.5),
rgb(0,0,1,0.5)), pt.cex=2, pch=15 )
#Histogram Plot of Developed vs Developing using relative frequency
#make overlaying histograms of the two subgroups aes(x = rank, y = gpa, fill = admit)
# First distribution
hist(life_developing$Life.expectancy, col=rgb(1,0,0,0.5), xlab="Life Expectancy",
ylab="Proportion", main="Life Expectancy of Developed vs Developing Countries-Relative Frequencies" , freq=F,
ylim = c(0,0.15))
# Second with add=T to plot on top
hist(life_developed$Life.expectancy, col=rgb(0,0,1,0.5), add=T, freq=F)
# Add legend
legend("topright", legend=c("Developing","Developed"), col=c(rgb(1,0,0,0.5),
rgb(0,0,1,0.5)), pt.cex=2, pch=15 )
#t-test of developed vs developing of variables found to have significantly high correlations with Life Expectancy
ttest_income = t.test(life_developed$Income.composition.of.resources, life_developing$Income.composition.of.resources)
ttest_school = t.test(life_developed$Schooling, life_developing$Schooling)
ttest_BMI = t.test(life_developed$BMI, life_developing$BMI)
ttest_AIDS = t.test(life_developed$HIV.AIDS, life_developing$HIV.AIDS)
ttest_income
ttest_school
ttest_BMI
ttest_AIDS
#Correlation of variables table
ttest_table = matrix(c("0.836", "0.596", "<2e-16", "0.15.6", "11.5", "<2e-16", "52.3", "35.7", "<2e-16", "0.10", "2.31", "<2e-16"), ncol=3, byrow=TRUE)
colnames(ttest_table)=c("Mean Value Developed", "Mean Value Developing", "P-value")
rownames(ttest_table)=c("Income level, from 0 to 1", "Years of Schooling", "BMI", "AIDS, per 1000 people")
library(knitr)
kable(ttest_table, caption="T-test Between Developed and Developing Countries for Variables Highly Correlated to Life Expectancy")
| Mean Value Developed | Mean Value Developing | P-value | |
|---|---|---|---|
| Income level, from 0 to 1 | 0.836 | 0.596 | <2e-16 |
| Years of Schooling | 0.15.6 | 11.5 | <2e-16 |
| BMI | 52.3 | 35.7 | <2e-16 |
| AIDS, per 1000 people | 0.10 | 2.31 | <2e-16 |
ggplot(life, aes(x=Year, y=Life.expectancy, col=Status))+
geom_point() +
geom_smooth() +
facet_wrap(~Status)
data <- read.csv("lifeexpectancydata.csv")
data_world_map <- data.frame(data)
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'USA')
data_world_map$Country[data_world_map$Country == 'United States of America'] <- 'USA'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'UK')
data_world_map$Country[data_world_map$Country == 'United Kingdom of Great Britain and Northern Ireland'] <- 'UK'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Russia')
data_world_map$Country[data_world_map$Country == 'Russian Federation'] <- 'Russia'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Bolivia')
data_world_map$Country[data_world_map$Country == 'Bolivia (Plurinational State of)'] <- 'Bolivia'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Brunei')
data_world_map$Country[data_world_map$Country == 'Brunei Darussalam'] <- 'Brunei'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Czech Republic')
data_world_map$Country[data_world_map$Country == 'Czechia'] <- 'Czech Republic'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'North Korea')
data_world_map$Country[data_world_map$Country == 'Democratic People\'s Republic of Korea'] <- 'North Korea'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Iran')
data_world_map$Country[data_world_map$Country == 'Iran (Islamic Republic of)'] <- 'Iran'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Laos')
data_world_map$Country[data_world_map$Country == 'Lao People\'s Democratic Republic'] <- 'Laos'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Micronesia')
data_world_map$Country[data_world_map$Country == 'Micronesia (Federated States of)'] <- 'Micronesia'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'South Korea')
data_world_map$Country[data_world_map$Country == 'Republic of Korea'] <- 'South Korea'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Moldova')
data_world_map$Country[data_world_map$Country == 'Republic of Moldova'] <- 'Moldova'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Saint Vincent')
data_world_map$Country[data_world_map$Country == 'Saint Vincent and the Grenadines'] <- 'Saint Vincent'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Syria')
data_world_map$Country[data_world_map$Country == 'Syrian Arab Republic'] <- 'Syria'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Macedonia')
data_world_map$Country[data_world_map$Country == 'The former Yugoslav republic of Macedonia'] <- 'Macedonia'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Tanzania')
data_world_map$Country[data_world_map$Country == 'United Republic of Tanzania'] <- 'Tanzania'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Venezuela')
data_world_map$Country[data_world_map$Country == 'Venezuela (Bolivarian Republic of)'] <- 'Venezuela'
levels(data_world_map$Country) <- c(levels(data_world_map$Country), 'Vietnam')
data_world_map$Country[data_world_map$Country == 'Viet Nam'] <- 'Vietnam'
two_thousand <- subset(x=data_world_map, data_world_map$Year==2000)
loadPkg("ggmap")
loadPkg("tidyverse")
loadPkg("dplyr")
world_data <- map_data('world')
combined_2000 <- world_data[two_thousand$Country %in% two_thousand$Country, ]
combined_2000$value <- two_thousand$Life.expectancy[match(combined_2000$region, two_thousand$Country)]
countries <- unique(combined_2000$Country)
cdf <- data.frame(label1=countries)
for(i in cdf){
combined_2000$value <- ifelse(combined_2000$Country %in% cdf$label1[i], (two_thousand$Life.expectancy), combined_2000$value)
}
ggplot(data=combined_2000, aes(x=long, y=lat, group=group, fill=value)) +
geom_polygon(colour="white") +
scale_fill_continuous(low="blue",
high="orange",
guide="colorbar") +
theme_bw() +
labs(fill="Life Expectancy", title="Life Expectancy Across the World (2000)", x="", y="") +
scale_y_continuous(breaks=c()) +
scale_x_continuous(breaks=c()) +
theme(panel.border=element_blank())
loadPkg("ggmap")
loadPkg("tidyverse")
loadPkg("dplyr")
two_thousand_five <- subset(x=data_world_map, data_world_map$Year==2005)
combined_2005 <- world_data[two_thousand_five$Country %in% two_thousand_five$Country, ]
combined_2005$value <- two_thousand_five$Life.expectancy[match(combined_2005$region, two_thousand_five$Country)]
countries <- unique(combined_2005$Country)
cdf <- data.frame(label1=countries)
for(i in cdf){
combined_2005$value <- ifelse(combined_2005$Country %in% cdf$label1[i], (two_thousand_five$Life.expectancy), combined_2005$value)
}
ggplot(data=combined_2005, aes(x=long, y=lat, group=group, fill=value)) +
geom_polygon(colour="white") +
scale_fill_continuous(low="blue",
high="orange",
guide="colorbar") +
theme_bw() +
labs(fill="Life Expectancy", title="Life Expectancy Across the World (2005)", x="", y="") +
scale_y_continuous(breaks=c()) +
scale_x_continuous(breaks=c()) +
theme(panel.border=element_blank())
loadPkg("ggmap")
loadPkg("tidyverse")
loadPkg("dplyr")
two_thousand_ten <- subset(x=data_world_map, data_world_map$Year==2010)
combined_2010 <- world_data[two_thousand_ten$Country %in% two_thousand_ten$Country, ]
combined_2010$value <- two_thousand_ten$Life.expectancy[match(combined_2010$region, two_thousand_ten$Country)]
countries <- unique(combined_2010$Country)
cdf <- data.frame(label1=countries)
for(i in cdf){
combined_2010$value <- ifelse(combined_2010$Country %in% cdf$label1[i], (two_thousand_ten$Life.expectancy), combined_2010$value)
}
ggplot(data=combined_2010, aes(x=long, y=lat, group=group, fill=value)) +
geom_polygon(colour="white") +
scale_fill_continuous(low="blue",
high="orange",
guide="colorbar") +
theme_bw() +
labs(fill="Life Expectancy", title="Life Expectancy Across the World (2010)", x="", y="") +
scale_y_continuous(breaks=c()) +
scale_x_continuous(breaks=c()) +
theme(panel.border=element_blank())
# Combining and plotting LE data from the year 2015
loadPkg("ggmap")
loadPkg("tidyverse")
loadPkg("dplyr")
two_thousand_fifteen <- subset(x=data_world_map, data_world_map$Year==2015)
combined_2015 <- world_data[two_thousand_fifteen$Country %in% two_thousand_fifteen$Country, ]
combined_2015$value <- two_thousand_fifteen$Life.expectancy[match(combined_2015$region, two_thousand_fifteen$Country)]
countries <- unique(combined_2015$Country)
cdf <- data.frame(label1=countries)
for(i in cdf){
combined_2015$value <- ifelse(combined_2015$Country %in% cdf$label1[i], (two_thousand_fifteen$Life.expectancy), combined_2015$value)
}
ggplot(data=combined_2015, aes(x=long, y=lat, group=group, fill=value)) +
geom_polygon(colour="white") +
scale_fill_continuous(low="blue",
high="orange",
guide="colorbar") +
theme_bw() +
labs(fill="Life Expectancy", title="Life Expectancy Across the World (2015)", x="", y="") +
scale_y_continuous(breaks=c()) +
scale_x_continuous(breaks=c()) +
theme(panel.border=element_blank())
loadPkg("dplyr")
data_le_means <- data.frame(data)
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'USA')
data_le_means$Country[data_le_means$Country == 'United States of America'] <- 'USA'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'UK')
data_le_means$Country[data_le_means$Country == 'United Kingdom of Great Britain and Northern Ireland'] <- 'UK'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Russia')
data_le_means$Country[data_le_means$Country == 'Russian Federation'] <- 'Russia'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Bolivia')
data_le_means$Country[data_le_means$Country == 'Bolivia (Plurinational State of)'] <- 'Bolivia'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Brunei')
data_le_means$Country[data_le_means$Country == 'Brunei Darussalam'] <- 'Brunei'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Czech Republic')
data_le_means$Country[data_le_means$Country == 'Czechia'] <- 'Czech Republic'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'North Korea')
data_le_means$Country[data_le_means$Country == 'Democratic People\'s Republic of Korea'] <- 'North Korea'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Iran')
data_le_means$Country[data_le_means$Country == 'Iran (Islamic Republic of)'] <- 'Iran'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Laos')
data_le_means$Country[data_le_means$Country == 'Lao People\'s Democratic Republic'] <- 'Laos'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Micronesia')
data_le_means$Country[data_le_means$Country == 'Micronesia (Federated States of)'] <- 'Micronesia'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'South Korea')
data_le_means$Country[data_le_means$Country == 'Republic of Korea'] <- 'South Korea'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Moldova')
data_le_means$Country[data_le_means$Country == 'Republic of Moldova'] <- 'Moldova'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Saint Vincent')
data_le_means$Country[data_le_means$Country == 'Saint Vincent and the Grenadines'] <- 'Saint Vincent'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Syria')
data_le_means$Country[data_le_means$Country == 'Syrian Arab Republic'] <- 'Syria'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Macedonia')
data_le_means$Country[data_le_means$Country == 'The former Yugoslav republic of Macedonia'] <- 'Macedonia'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Tanzania')
data_le_means$Country[data_le_means$Country == 'United Republic of Tanzania'] <- 'Tanzania'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Venezuela')
data_le_means$Country[data_le_means$Country == 'Venezuela (Bolivarian Republic of)'] <- 'Venezuela'
levels(data_le_means$Country) <- c(levels(data_le_means$Country), 'Vietnam')
data_le_means$Country[data_le_means$Country == 'Viet Nam'] <- 'Vietnam'
le_na = subset(data_le_means, is.na(Life.expectancy))
le_clean = anti_join(data_le_means, le_na)
country_and_status <- le_clean %>% group_by(Country, Status) %>% summarise(mean_le = mean(Life.expectancy))
loadPkg("ggmap")
loadPkg("tidyverse")
loadPkg("dplyr")
world_data <- map_data('world')
combined_means <- world_data[country_and_status$Country %in% country_and_status$Country, ]
combined_means$value <- country_and_status$mean_le[match(combined_means$region, country_and_status$Country)]
countries <- unique(combined_means$Country)
cdf <- data.frame(label1=countries)
for(i in cdf){
combined_means$value <- ifelse(combined_means$Country %in% cdf$label1[i], (country_and_status$mean_le), combined_means$value)
}
ggplot(data=combined_means, aes(x=long, y=lat, group=group, fill=value)) +
geom_polygon(colour="white") +
scale_fill_continuous(low="blue",
high="orange",
guide="colorbar") +
theme_bw() +
labs(fill="Life Expectancy", title="Mean Life Expectancy Across the World (2000-2015)", x="", y="") +
scale_y_continuous(breaks=c()) +
scale_x_continuous(breaks=c()) +
theme(panel.border=element_blank())